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ABSTRACT 

In order to obtain conditional maximum likelihood 
estimates, the so-called conditioning estimates have to be 
calculated. In this paper a method is examined that does not 
calculate these constants exactly, but approximates them using Monte 
Carlo Markov Chains. As an example, the method is applied to the 
conditional estimation of both item and person parameters in the 
Rasch model. The key idea for this approach was developed by C, J. 
Geyer and E. A. Thompson (1992), who showed that, in the exponential 
family, a quantity that is proportional to the conditioning constant 
can be expressed as an expectation with respect to a certain 
distribution. Simulating from this distribution, an estimate of the 
proportional quantity can be obtained as the observed sample mean. 
Inserting this estimate into the conditional likelihood then allows 
one to maximize the approximate likelihood, as the proportionality 
constant does not depend on the parameters to be estimated. (Contains 
5 tables, 1 figure, and 11 references.) (Author/SLD) 
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Abstract 

In order to obtain conditional maximum likelihood estimates, the so-called 
conditioning constants have to be calculated. In this paper a method is exa- 
mined that does not calculate these constants exactly, but approximates them 
using Monte Carlo Markov Chains. As an example, the method is applied to 
the conditional estimation of both item and person parameters in the Rasch 
model. 

Key words: CML estimation, Monte Carlo methods, Markov Chains, Rasch 
Model. 
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Introduction 

In this paper, an alternative calculation method is examined for conditional 
maximum likelihood estimation (CML) in item response models that belong to 
the exponential family. In order to obtain CML estimates, the conditioning con- 
stants have to be calculated. These constants may be difficult to compute. The 
method examined provides an alternative to calculating the constants exactly: 
they are approximated using Monte Carlo methods. The key idea for this ap- 
proximation was developed by Geyer and Thompson (1992). They showed that 
in the exponential family a quantity which is proportional to the conditioning 
constant can be expressed as an expectation with respect to a certain distribu- 
tion. Upon simulating from this distribution, an estimate of the proportional 
quantity can therefore be obtained as the observed sample mean. Inserting 
this estimate into the conditional likelihood then allows one to maximise the 
approximate likelihood, as the proportionality constant does not depend upon 
the parameters to be estimated. In the first section below the method will be 
explained in some detail. The next section will consist of a description of the 
simulation process: as the distributions from which to simulate may be rather 
complex, Monte Carlo Markov Chain methods, such as Hastings or Gibbs sam- 
pling may be necessary. The resulting estimation equations will be examined 
more closely in section 3. 

As an example, the method discussed will be applied to the conditional esti- 
mation of both item and person parameters in the Rasch Model (Rasch 1960). 
In section 4 estimates obtained using the above method will be compared to 
exact CML estimates. The results seem very acceptable. 
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A Monte Carlo method for CML estimation 

In this section it will be explained how a method developed by Geyer and 
Thompson (1992) can be applied to conditional maximum likelihood estimation 
in item response models. The method will be applied to the Rasch model. In 
order to make the theory to follow more understandable, the section will there- 
fore start with a description of the conditional Rasch model. 

The conditional Rasch Model 

Consider a test consisting of M items. Let Xj = x t with x = 0, 1 be a response 
to item j, and let the variable X = x be the A4-vector of responses to the entire 
test. The total score T is defined as 

M 
> = 1 

Then for the Rasch model the conditional likelihood for one observation, i.e. 
response pattern, looks like 

Pr(X | T;M) = « r ya\ 

where 6 is a vector of difficulty parameters for the j items, and 9 denotes the 
latent ability. The summation in the denominator is over all possible response 
patterns with the same total score T = t. These denominators, there is one for 
each value of T, are known vs eie/nentary symmetric functions and they will be 
denoted in this paper by ~ft(6): 

7l («)= £ expC-^A'^). 

x t(x)=t 

With this notation the above formula can be rewritten as 

ex P(-Ej -Vj) 



Pr(X|T;M) = 



7i(«) 
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The corresponding log likelihood is given by 



\og L(S\T) 



= -log 7< (*) 



so that the log-likelihood for the whole sample becomes 



logL(* | T) = -£\ S j S j -J^Nt \og lt (S) 



where N t denotes the number of persons in the sample with T = t, S ; is the 
item total and T is the column vector of observed total scores. The solution 
equati< ns are obtained upon setting the partial derivatives of this equation with 
respect to the Sj's equal to zero. These partial derivatives are given by 



in which the numerator is equal to d-) t {S)ldbk, with T t J x {6) a gamma function 
for the set of items not containing item k. For example, if there were 3 items, 
we would have 



For the Rasch model, recursive formulae are available for calculating 7t(£); but 
for other IRT models this is not always the case. 

Approximating 7t(^): the Geyer and Thompson method 

Instead of calculating the gamma functions it is possible to approximate them 
with the help of Monte Carlo Markov Chain (MCMC) methods, using the fol- 
lowing idea developed by Geyer and Thompson (1992). Let the probability 




72 



exp - 6 2 ) + exp (-61 - £3) + exp (-6 2 - 63) 



exp {-6 2 -63). 
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density function for one observation in the conditional formulation of the Rasch 

Model be known as /<(X;£); then the corresponding conditional likelihood of 6 

as a function of X can be written as / t (£;X). Note that in this notation the 

conditioning variable T has moved from behind the bar to a subscript on /. 

Omitting the dependence of the likelihood on X we therefore have 

exp(— 5V Xi6j) 
f t (6) = Pr(X | T;M) = V \fy ' • 

Now if rl> were another set of parameters, then ft(i>) would be equal to 

ftW) = Pr(X | 7;0» = . 

Trivially, using the definition of j t (6) and multiplying by one, 

so that, moving one 7t(V>) from the right to the left hand side of the equals sign 
and using the definition formula of ft(il>) 



^5 " ^{-C^-*}- ,(*» 

x t(x)=< 



In other words, if we define 

then, if we were able to simulate a random sample of B observations' from 
f t (tl>) = /t(X; VO we could estimate all d t (SYs by the sample means: 

B 

d <W = A £ ex P 'W*; - </;)} 

6 = 1 
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for any value of 6. Note that the first subscript nn \', the t } is there to indicate 
that every simulated response vector X belongs to a ?et having common total 
score T. Defining log L* to be 

log L'(S | T; V) - -£; 6jSj - ]T> logdi(«), 

t 

note that 

t 

t t 

= \o g L(6\T) + J2 N <y<W 
t 

attains its maximum for the same value of 6 as does logL(6 | T). So we now 
can substitute d t (6) for d t (S) and maximise the resulting expression 

lo g r(«|T;V>) « -E,^-53iV|logd t («) 

= S J - E N ' lo 8 i^f=« ex P {"Si 'Wi " *i >} 

with respect to 6 to get an approximate solution to the original likelihood equ- 
ation. 

However, the density f t (X\tl>) depends on a similar constant as does /t(X; 6), 
i.e. on 7«(V0> an d ^ us constant is still unknown, or at least difficult to calculate; 
therefore it is not possible to simulate from /t(X;V0 in any direct way. The 
solution to this can be found in the use of Markov Chains and the Hastings 
algorithm, which will be described below. 
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Simulation of response patterns: the Hastings algorithm 

A Markov Chain is a sequence of realisations of a random variable A' with the 
property that 

Pr(A'ib = x k | A'i =xj A'*-, = xjfc-0 = ?v(X k = x k \ X k -i = xjb_i). 

Here the subscript k is used to denote the ordering of the sequence in time; and 
A' either may or may not be a vector valued variable; in this section I will not 
use boldface to distinguish between the two. In the Markov Chain context the 
value of X is often called the 'state' of A. The probabilities of going from one 
state to another in a Markov Chain can be represented in a matrix P, having 
as entries pij = Pr{Xi = i | A'<_i = j)\ hence the rows of P add up to one. 
The Markov Chain is said to be irreducible if it is possible to get from any state 
to any other state in a finite number of transitions. The states of irreducible 
Markov Chains on finite sets of values can be shown to follow a unique limiting 
or ergodic distribution; denoting this (discrete) distribution by it, it is given as 
the solution to 

TrP = 7T 

i.e. if the transitions are made according to P, then for large A r , Pr(A^+iV = i \ 
X k = j) = Ki, independent of the value of A'* (see e.g. Proth & Hillion 1990). 

For our calculations we need simulations from f t (X\t{>). Now imagine it 
would be possible to find a transition matrix P that has its limiting distribution 
7r equal to f t (X\tl>). Then we could start from any initial response pattern X*, 
and using the proper row wit h conditional probabilities from matrix P we could 
then simulate a new state for the response pattern, say X^+i (the subscript still 
indicates ordering in time), and so on; until, after A r successive state changes, 
we would have Xjb + /v ~ 7r = f t {X\tj>). Using this Monte Carlo Markov Chain 
sampling scheme gives us an estimator d t (6) that is asymptotically normally 



f i 
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distributed and approaches dt(6) in mean square as the number of Monte Carlo 
samples B — > oo (Hastings 1970). 

Of course, the problem is to find P from nP = 7r. In general, this is impos- 
sible. However, Metropolis et al. (1953) and Hastings (1970) found a way to 
sample from 7r without actually knowing P. Let A" be the present state of the 
sequence, and X f an alternative state. Then their algorithm is as follows: 

1. Define any convenient transition matrix Q 

2. Propose a new state, say A'', for the variable A, in our case the response 
pattern X, according to the probabilities in 'he relevant row of Q 



4. Accept the proposed state with probability a. 

The algorithm can be proved to work, i.e. to generate a sequence with the 
desired ergodic distribution, using the detailed balance* lemma. This lemma 
states the following: if for irreducible P it holds that 7r»p;j = *jPji ^ nen ff ls tne 
limiting distribution for P. Substituting qij for p, ; , this condition can be easily 
< 'lecked to hold in the above algorithm. 

Recalling that in our case n(x) represents /t(X; t/0 = exp A' ; 0j )/')t{il>) 

the denominators of 7r( A) and of tt(A'') in step 3 are equal. Thus the cleverness 
of the algorithm lies in the fact that in calculating X }* X )Q(\\X') tnere is no ncec * 
to calculate these denominators as they will cancel. 

The algorithm as presented above is known as the Metropolis algorithm; 
when a symmetric matrix Q is used it is called the Hastings algorithm. In 
that case also the terms involving Q will cancel and a reduces to a(A, A') = 
min |l, V( A'/ } A n °t- ner improvement, in the case of a vector valued variable 
A', would be not to propose a new state A'' ad random, but to consider changing 



3. Define 
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only one element of X at a time. If the variables in X are independent of each 
other, or depend only on a few other variables, many additional factors in * j 
will cancel. In this latter case, the generation of one new simulation consists of 
sequentially considering a new value for each of the variables in X in turn. This 
is also called a 'full scan'; therefore, after one full scan the next simulated value 
for X has been obtained. 
Recapitulation: 

1. We want to <k> conditional maximum likelihood estimation for the para- 
meters in the Rasch model. This means we need to know the values of the 
conditioning constants 

2. We want to estimate these constants by the method proposed by Geyer 
and Thompson. That means we need artificial data drawn from /t(X;t/>), 
where V' is an arbitrary point in the parameter space, 

3. To draw the samples from /«(X; V>) we want to use the Hastings algorithm. 

Assuming we have actually drawn the sample, we then proceed to maximise the 
following equation, in which d t (S) has been expanded in full: 

log L'(6 | T;V>) = -Ej *J S t - Ei "t log [^Ef=, exp {- £j A,^ - V; )}] 
with partial derivatives 

d log// _ _ c; „ Et x tbk exp { - Ej X,n{6j - H'j )} 

d6k ' ' E6exp{-E;A ( 6;(^-V.';)} 

In the next section the resulting estimation equations will be examined more 
closely. 
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Estimation 



The two main tasks to be performed for the implementation of the estimation 
method proposed in section 1 are the simulation of the Monte Carlo data, and 
the actual estimation. In this section I will comment on several aspects of the 
estimation equations. In particular I will compare the equations for the Monte 
Carlo estimation to those for ordinary CML estimation. But I will start with 
another description of the algorithm for generating response vectors. 

Generation of Monte Carlo response patterns 

Starting from the current realisation of a response vector X, the next realisation 
will be obtained after one 'full scan'. Recall that we are simulating from a distri- 
bution conditional on total score. Therefore we cannot change only one value 
Xj at a time: a new proposal state X' has to be obtained by interchanging the 
position of two different values in the response vector. A full scan then consists 
of the following steps (note that the subscript on X again denotes items instead 
of temporal ordering): 



2. if Xi — x % with x = 0 or 1, then randomly choose one of the items, say 
item j, with Xj = 1 - x 

3. the proposed state X' is the response vector with A',- = 1 - x and A ; = x 

4. accept this pioposed state with probability a = min jl, ^* 



5. if i < M t with M the number of items, then i t 4* I, and go back to 2; 
else stop: the next simulated response vector has been obtained. 

The newly arrived at simulated response vector will in turn serve as input 
for the next full scan etc. There may be need for a 'bum-in' period in the very 



l.i=l 
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beginning, in order to allow the algorithm to move away from a possibly badly 
chosen starting configuration X. 

Of course, the response vectors simulated in this way will not be completely 
independent. The autocorrelation could be reduced by inserting 2 or more scans 
between successive Monte Carlo simulations. But as its only influence will be on 
the variance of d t (6), it is equally well possible to use all the generated response 
patterns and to go on generating them until the variance is acceptable. 

Having obtained a 'fairly large number', B, of simulated response vectors it 
is possible to start estimating S. 

Starting values 

As starting values, i.e. first choice of the well known Gustafsson (1979) 
starting values are used: 

/ (0) Sj - Sj 

L^t-\ yv « W(Af-l) 

in which Sj is the item total Y^!i-\ an< ^ ^3 tne avera g e °f the item totals: 

If 6 is far away from t/>> the approximation of log L by logL* might not 
be too well at 6 = 6. Therefore it is probably wise to use the estimate S as 
a new rfc and repeat this preliminary procedure a couple of times before go,ng 
on to simulate a truly large sample that will be used for estimation. Geyer 
and Thompson also advise to use a restriction on the maximum steplength per 
iteration, but in view of the good starting values available for CML estimation 
in the Rasch model, I found there was no need for such a restriction. 

Identifiability 

To begin with, the usual constraint b. = 0 or b\ =0 has to be imposed. 
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Next, it is well known (Ford 1957) that in order for a solution to the ordinary 
CML equations to exist, all persons and items with perfect scores (T = 0 or 
T = M , and Sj = 0 or Sj = N) have to be deleted from the sample. Meaningful 
estimates for these persons and or items cannot be obtained, and deleting them 
will not influence the estimation of the remaining parameters. 

For the present estimation procedure this latter assertion remains yet to be 
investigated. As to the influence of perfect persons on item parameter estima- 
tion, we had 

log L" = -E; - E< H< log [££f=, exp {-Zi Xtvtfi - </>,)}] . 

It is evident that the term with T = 0 does not contribute to the function value, 
as all the A'^/s are equal to zero, so we get log(l/S x B) = log 1 = 0. Likewise, 
the term with T = M would have no contribution. Let there be k persons with 
total score T = M , then if wo would delete these persons we would have log L* 
equal to 



The terms with k £V ^ cancel, and although the resulting formula is not exactly 
equal to the original log-likelihood, the difference does not depend on the pa- 
rameters to be estimated, so it will result in the same estimates. For these 
reasons, persons with perfect response patterns can be safely omitted from the 
estimation of item parameters. 

Next, to the influence of perfect items on the estimation of other item para- 
meters. It seems plausible to use tp p = oo or t/'p = —oo for items with S p = 0 or 
S p = N respectively. Thus they will generate only perfect response patterns. It 
is easy to show that deleting those items from the loglikelihood will not influence 
the estimation equations for the other items. 
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Rank of the system of equations 

One topic clearly needs some more attention. As Sj is sufficient for Sj, in the 
context of ordinary CML there will only be as many estimation equations as 
there are different values of Sj. The situation is different for the Monte Carlo 
CML equations. Here we have 



and because of the Monte Carlo processes there is no guarantee that if Sk = Si, 
then also X tb k will be equal to A' t M, not even if you take V ; t = in- So without 
taking any precautions one would in this case end up with different estimators 
for items with the same value of the statistic. This is clearly an undesirable 
state of affairs. The easiest way out would seem to retain only one of the items 
with equal values on Sj in the analysis, but that would cause problems for the 
conditional Monte Carlo sampling scheme. Instead, I decided to average the 
estimation equations for items with equal values for Sj. This has implications 
for the way the equations can be written. If S* = 5/, then Sk will have to be 
equal to Si. To begin with, I therefore take t/ ; fc = fi< This will prove to be a 
convenient choice. Now, with Sk — Si, we have 



d\ogL 



= -S k +^ t N t 



r E& A ^* ex p{-E ; A ^;(*j-^)} 

u r- ^ <| 



J 



I...M, 



and 



E& Xtbi ex p{-EjA y <& j(<^ -t/', )} 
E 6 exp X tbj (Sj : - V»j)} 



Si = Ei N t 



1 , . . M 



so that Sk + S\ is equal to 
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r — : — : 1 — ■ 

If 0 t = V'/, and if we let V^jt' be equal to A'^jt + A'tw the above equation 
simplifies considerably. Likewise, we can do the same for all sets of items with 
the same value for 5, so that we get a different vector of observations, say Y, 
in which Yj is equal to a sum over several A' ; s. Now we can write 

SkM h =Y. N * r 1 1 r 1 ' j = l...M a ,h=l...M a 

where M a is the number of different values actually appearing for 6\ and A//, 
the number of items with S = s/». The above becomes particularly relevant 
when estimating abilities instead of difficulties, because usually the number of 
persons will be much larger than the number of items. As there only are M - 1 
different values for 7\ then only M — 1 equations will be necessary instead of 
N. 



Estimating abilities 

All the above applies equally well to estimation of 0 as to estimation of 6. The 
same routines can be used to maximise both sets of equations. The only cor- 
rection that has to be made is that the sign of the outcomes when estimating 
9 has to be reversed. It would have been fortunate if the same Monte Carlo 
data could be used for both estimation procedures. However, sampling under 
the condition of one fixed marginal (say for 7'), will change the other marginal, 
so this is not a feasible possibility. 

If both 0 and S are estimated using CML, they still have to be posit ioned on 
to a common scale. This can be achieved by first estimating 0 and 6 separately, 
and then setting 6* = 9 + frl, the latter being a vector of ones. Then optimise 
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the joint maximum likelihood for 6* and 6 with respect to k y c*nd finally take 



All function maximisations were carried out using the Fletcher- Reeves algo- 
rithm, in a a slightly modified form proposed by Polak and Ribiere (for details, 
see Press, Teukolsky et al 1992). 



This section gives some results obtained in testing the Hastings algorithm used 
for the simulation of response patterns. The last part of the section will compare 
the Monte Carlo estimates to exact CML estimates for 3 small real data sets. 



For some of the tests all possible response patterns have to be considered. The- 
refore some small data sets, preferably with known or previously estimated 
parameter values, were necessary. I used data provided by Thissen (1982). He 
reports the results of CML estimation on a 10 item memory test. This test was 
taken by 40 persons, 5 of which had a zero score, so there were 35 of them left 
for estimation. In addition he reanalyses two 5-item sections of the law school 
admissions test. These two subtests, which will be denoted as Isat6 and lsat7, 
were analysed earlier by Andersen and Madsen (1977) and by Bock and Lieber- 
man (1970). The data represent responses of 1000 subjects drawn from a larger 
sample of students applying for, admission to law schools at various universities 
in the United States. After omitting persons with perfect scores, 699 and 680 
respectively remained for analysis. 



as ycur estimates 9 + hi and 6. 



Results 



The data 
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Response pattern generation 

In order to test the algorithm for simulation of conditional response vectors, data 
were generated for a 5-item test with item difficulties equal to the starting va- 
lues for the lsat6. For the four non-perfect values of the total score 500 response 
vectors were generated from /t(X;^/>) using the Hastings algorithm described 
in section 1. As the number of different response patterns is quite small in this 
case (2 5 — 2 = 30), it was possible to calculate the theoretical conditional pro- 
babilities, i.e. the value of f t {X\ \l>) = exp(-£j Xj8j)/y t {il>)> for each pattern 
and to compare the observed frequencies for the generated response patterns to 
the expected ones. Next, a chi-square statistic \* 2 = £ (fobs - fexp) 2 /fexp was 
calculated for each conditional distribution /t(X; V0- This process was repeated 
with Monte Carlo sample sizes B equal to 1000 and 2000. The results are given 
in table 1. 

Insert table 1 about here 

None of the values in this table is significant, but two remarks apply. First, 
it is probably not really justifiable to perform a \ 7 goodness of fit test, because 
the generated response patterns are not completely independent, being subsequ- 
ent realisations under the Hastings algorithm. Therefore, the values should be 
interpreted with some care. And second, as a result of the randomness in the 
data simulation process, the generation of tables like table 1 is in this case itself 
a random process. So ideally the analysis should be repeated a large number of 
times, say a 1000, and the results studied. I did not do that; 1 only repeated 
it several times. Although sometimes significant \ 2 values appeared, which is 
to be expected, largely the pattern was as above. Therefore, although more 
rigourous tests still can be done, for the time being the conclude was that the 
response pattern generator works satisfactorily. 
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The estimator d t (6): sequential plots 

The next thing to investigate was the performance of the estimator d t (S). First, 

(6) 

some plots were constructed depicting the relationship between d t (S) and 6, 
where d t (S) ib) is the value for d t (S) as calculated from the first 6 (out of B) 
Monte Carlo simulations. Then dt{tif^ is plotted against 6, so that a kind of 
time series plot results. Now there are several factors which will influence the 
estimate c/ t (6) (6 \ Firstly, of course, there is the number of simulations 6 upon 
which it is based. Hopefully, with increasing 6, the estimate will become stable, 
i.e. converge to a certain value. In other words, d t (S) — d t (S) should go 
to zero for large b and any value of j. Next, it seems likely that the goodness of 
the estimate will be influenced by the shape of the distribution of / { (X;V0, as 
the empirical pmf of a sample from a 'regularly shaped' distribution in general 
will more closely resemble the shape of its parent than a sample of the same 
she from an irregularly shaped distribution. Thirdly, recall that d t (S) estimates 
7r(£)/7r(V>)» and that tne estimate will probably be better for 6 close to V>, and 
worse for S a large distance from V>- So the distance from S to ^» is a third 
factor that might influence the 'goodness' of the estimate. 

Ideally, what I should have done is use a fixed value for 6 % then generate 
response vectors at various points t/'t and finally examine the behaviour of 
d t (6) {b ^ for each value of V>- However, 1 decided to work the other way around: 

(6) 

use a fixed tfr and examine the behaviour of d t (S) for various possible values 
of S. The advantage of course is that now there is need to simulate only one 
data set instead of several ones. I wanted to look at estimates for 6 close to 
%l>, far away from V>< and in between. Arbitrarily, I decided that 'close' would 
mean | h } - V f > |< .1, Vj. As it would be rather artificial to have all 6j - V'j 
equal to each other, a S close to x{) was generated by sampling 6j from a uniform 
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distribution on (i/'j — .1, tf'j + 1) f° r eacn J- In a similar way 6j 's were sampled 
from ^(V'j - 3, 4'j + 3) to get a faraway point, and from t/(0j — 1, + 1) to 
get an intermediate point. Now for the faraway point not all the 6/s will in fact 
be far from the respective V'/s, but this choice will prove interesting. 

Having generated Monte Carlo data at rj) and having found an arbitrary value 

(6) 

for 6 in the required distance range, the sequential estimates for d t (6) were 
then calculated for 6 ranging from 1 to 2000. Informally repeating this several 
times with different values for the seed, it appeared that sometimes the results 
were as expected, and sometimes they were not. 

Insert Figure 1 about here 

To begin with an example of a nice result, figure 1 displays the plots for 
three arbitrarily chosen values of t for a ten item data set using as tj) the starting 
values for the memory data, for the largest distance of 3. Plots for B = 500 and 
B = 2000 are placed next to each other; the 500 simulations are the first 500 
from the 2000. After 500 simulations the estimates do not seem to have reached 
their equilibrium completely. After 2000 simulations the lines look smooth, but 
note that for example the line for t = 7 still seems to be increasing; also, the 
line for t = 5 still shows occasional small wiggles. However, bearing in mind 
that these are plots for S a large distance away from ^, 1 think these results are 
quite nice. It would be hard to say whether they are representative, though. 
Most plots for d % .1 (d indicating the distance) were better than these ones, 
some were similar; and also there were plots for d w 3 t hat were v/orse than 
these ones. But on the whole I am inclined to believe that these plots are fairly 
average for this 10 item data set. For a 5 item data set on the other hand 1 found 
that sometimes the plots were more fluctuating, even withe B — 2000. A closer 
inspection showed that this could happen because of extreme differences in the 
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values for Xj(6j — V'j)- The explanation is as follows. In estimating d t {6) 
two processes are involved: each new response pattern in first generated and 
then it is added to $^exp|— J^. - V'j)} (for convenience omitting the 

subscripts 6 and t on X t bj)\ this means each response pattern has a probability 
of occurring, depending only on / ( (X;^), and it has a particular value for 
]Cj Xj{6j — v /j, j)» depending on the differences 6- - ipj . Now problems are likely 
to occur, for some fixed 6, when there is a (or a few) response patterns with 
a very small probability of occurrence, and at the same time a comparatively 
large negative value for £V Xj(6j - tl>j)\ large, that is, compared to the value 

of Y,j Xjify - ) f° r otner X. Then exp {-52j -V?(*i ~" )} can Decome ver y 
large indeed; so that in our example every once in a while a value of 2097 would 
be added to the summation, whereas many other values with higher probabilities 
would be equal to .5 or .11. It would take perhaps a B of 100,000 simulations 
to level this out. This is why I said above that the way of generating the S's 
would prove interesting. The finding can be stated differently too: there is no 
problem when 6j — V'j is of about the same size for all j. Then all the terms 
in the summation over B will be of about equal size, even for different response 
patterns X. Problems can arise if there are one or more j for which 6j — r/'j is 
very large negative, compared to the other differences. No problems will occur 
in the reverse case, when there is a reasonable number of items and 6j — v'; is 
very small for some j compared to the rest. 

Having understood a possible source of erratic behaviour of the sequential 
plots, the question becomes: do we have to worry about it? To answer this 
question a simulation study was conducted which will be described in the next 
section. One adjustment of the statement in Geyer and Thompson however 
has to be made: in our case it is not the overall distance from 6 to if> which 
influences the goodness of the estimator d t {6) most: the spread in the distances 
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bj — 0j, combined with the probability distribution /t(X;t/>) is probably more 
important. 

The estimator d t {6): Accuracy of the approximation of d t (6) 
Sequential plots, as drawn in the previous section, can be very enlightening 
and instructive, especially when one happens to come across one that displays 
unexpected or unwanted behaviour. But for finding out something about the 
average behaviour of the estimator alternative means are needed. I conducted 
a simulation study which is algorithmically given by: 

1. Take a specific value for i/> 

2. Choose a distance, say d — .1 

3. Choose a value for 6 within that distance from t/> 

4. Calculate d t (S) for B = 500 

5. Calculate d t (6) for B = 2000 

6. Repeat steps 3-5 1000 times and calculate the average and standard devi- 
ation for d t {6). Compare this with the expected value 

7. Repeat steps 2-6 for distances of 1 and 3. 

Of course, the trends that are so nicely visible in the sequential plots will 
not appear in this way: looking only at d t (6) for B = 500 and 2000 provides one 
with a look at a 'fixed point in the plot' only. Moreover, calculating the mean 
and variance of dt(6) for 1000 replications might not be very instructive in itself, 
as the values of dt(6) should in the first place be compared to 7* (V J )i which 
is different in each of the 1000 replications because they each have a different 
value for 6. Therefore, denoting 7i(£)/7<(V>) by d t (6), for each of the 1000 
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replications, the relative difference {d t (6) - d t (S)}/d t (6) was calculated. This 
relative difference was the variable of interest in the present investigation; its 
mean, standard deviation, minimum and maximum are displayed in table 2 for 
B = 500 and B = 2000. 

Insert Table 2 about here 

Again, the starting values for the lsat6 data were used to provide me with 
a point t/>. Looking at the top half of the table first, we see that for B = 500 
the average relative error is very small: for 6 near to i/> it is .001 at most. The 
associated standard deviations vary from .02 to .05; and the maximum values 
for the minimum and maximum relative errors are equal to .09 and .1.0 resp. 
The worst values occur for t = 4. For the other distances the estimate behaves 
worst for t = 4 as well. In the case of a distance 1, the average error for t = 4 
is about 14%, and for d as 3 the average estimate for t = 4 is about 1.67 times 
as big as what it should be. Once it even was 16 times too big. Although, in 
my opinion, the averages for the other values of t do seem acceptable for d as .1 
and d as 1, the associated standard deviations are rather large for d as 1. 

The values for B = 2000 are, on the whole, strikingly similar to those for 
B = 500. I can only conclude that on average it does not seem to make much 
difference whether you use a Monte Carlo sample size of 500 or of 2000 in 
estimating d t (S): on average the estimates are very reasonable for S within a 
short distance of tl>: but standard deviations of about 5% may still occur. For 
6 far away from xj) the estimates are unreliable; and for S in between I arn 
inclined to say they are not utterly reliable either. Note, however, that for item 
parameters in the context of 1RT a difference of 1 is quite substantial, and in 
practice our starting values will probably be closer to the final estimates. 

A consistent pattern seems to be that the estimates are better for smaller 
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values of t\ this is probably due to the fact that for small t the summation 
Y2j — consists of fewer terms (many of the x, being equal to zero), 

so that the differences between values of exp Xji&j ~ ^j)} *° r different 

response patterns wi*h the same total score cannot become very large. Also, 
this table has been obtained using a very small data set, consisting of only 5 
items. It would be interesting to see whether the results would be similar with 
larger numbers of items: the sequential plots shown in the previous section were 
in general more irregular for a 5 item test than they were for a 10 item test. As 
mentioned, this might be due to the fact that the pmf of a five-item response 
vector is probably more irregular than the pmf of a ten-item response vector. 

Comparis on of MC estimates to exact CML estimates 

In this section results will be presented for the estimation of item and ability 
parameters for the Isat and memory data. The results are for one estimation 
only, the procedure has not been replicated to examine the behaviour of the 
estimators. Table 3 contains the Monte Carlo parameter estimates for the Isat 
data together with the exact CML estimates. 

Insert Table 3 about here 

The Monte Carlo estimates have been calculated under the constraint 6\ = 0, 
and after the estimation process was completed, the estimates were rescaled to 
have a mean of 0. The fourth column in the table contains the differences 
between both estimation procedures. The Monte Carlo estimates are within 1 
standard error from the exact values. Therefore my conclusion is that they are 
at least acceptable. 

The ability estimates are presented in the light hand side of the same table. 
In this case, the so-called 'exact 1 estimates are no CML estimates at all: the 
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ability estimates are calculated treating the item parameters as known. Note 
that, although the estimated standard errors are much larger for the ability 
estimates than they were for the delta's, the Monte Carlo estimates are within 
the same order of distance from the 'exact' values as they were for the item 
parameters. 

Insert Table 4 about here 

Table 4 contains the estimates for the memory test. The differences between 
Monte Carlo and exact estimates seem to be somewhat smaller than for the 
lsat data. Certainly in view of the larger standard errors here (a consequence 
of the smaller sample size) this is very nice. Turning to the ability estimates 
we find that there were no subjects with total scores larger than 7. Therefore I 
was only able to calculate Monte Carlo estimates for 9 for T = 1 up to T = 7. 
This is in contrast to the usual estimation method: upon assuming all the item 
parameters are known, it is no problem to calculate ability estimates for any 
value of T, whether this value actually appears in the data or not. Again the 
differences are very small. 

Computer times and storage 

The storage required for this estimation procedure is huge. A tensor of approxi- 
mately size M x M x B has to be stored, in which B is the number of Monte 
Carlo samples, and M is the number of items. Storage is necessary because 
the maximisation of every log L* needs several iterations, in each of which the 
Monte Carlo data appear, together with different values for the parameters. If 
there are items with the same value for S, the size of the array can be reduced 
somewhat because of the fact that Monte Carlo responses for those items viii 
only appear in the equations added up together (i.e. Y ; 's instead of Xj % &) % so 
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they might as well be stored added up together. In the case of estimating 0 this 
is no trivial reduction, because otherwise the size of the array would have been 
of order M x N x B. 

For the three data sets I have examined, more than 90% of the computer 
time, perhaps even 99%, was spent in generating the response patterns. Once 
the response patterns were there, it was usually a matter of seconds, or even 
less, to see the actual estimates appearing on the screen. 

Insert Table 5 about here 

This is reflected in table 5 in which the CPU-times for the estimates reported 
in the previous section are given. The times are for a small UNIX machine. For 
the Isat data, the CPU-times for calculating ability estimates are much larger 
than the times for calculating item difficulties. This is because there are nearly 
700 persons in the sample, so that in estimating 6 for each item 700 responses 
have to be generated; whereas in estimating S only 5 items are involved. Clearly 
some work has to be done to see if these times can be reduced. One suggestion 
could be to reduce the number of Monte Carlo estimation cycles, or the size of 
the Monte Carlo samples, especially in estimating 6. At present I use an initial 
sample size of 500; when the squared Euclidean distance between two conse- 
cutive sets of parameter estimates becomes less than .10, I switch to a Monte 
Carlo sample size of 2000 and do 2 more maximisation cycles. In estimating 0 
however the precision reached after only 1 cycle with B = 500 was hardly ever 
increased by subsequent cycles with B = 2000. Considering the fact that for 
the estimation of ability parameters the simulated 'response vector' is of length 
N i it is not very surprising that these estimates are better, for the same value of 
B y than those of the item parameters, which are based on a response vector of 
length Af. So for the ability parameters a substantial reduction in the amount 
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of CPU-time might very well be possible. For the estimation of 6 I am not so 
sure a very large reduction is possible (apart from improvements in the style of 
programming). The switch to B = 2000 was always made after 1 or 2 cycles 
with B = 500, but sometimes the first cycle with B = 2000 would then again 
show a squared Euclidean distance of for example .14. Again, a suitable topic 
for further work. 



The estimation method examined in this paper seems to work, at least for small 
data sets. The approximate estimates produced are not too far away from the 
exact values. However, many things still need to be done; some of them I have 
not even started with. Here they come, not in any particular order. 



where pj is the autocorrelation for lag j. As pointed out before, the autocorrela- 
tion cannot a priori be assumed to be zero. Two methods for investigating this 

(6) 

are, first, similar sequential plots for the estimated variance of d t (6) as the 
ones for d t {sf b) itself. The second method would be to calculate the variance 
of d t (6) for different numbers of scans between two successive simulations; 
this should obviously reduce the autocorrelation. 

The parameter estimates produced in section 3 should be reproduced a large 
number of times with different values for the seed of the random number gene- 
rator, to get some additional insight into the average and standard deviation of 
the Monte Carlo estimates. 



Conclusion and discussion 



To begin with, the variance of the estimator d t (6) needs some more at- 
tention. According to Hastings (1970) this variance is equal to 
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The variance/covariance matrix of the parameter estimates has to be cal- 
culated as well. Probably it is possible to approximate the matrix of second 
derivatives by a method similar to the one used to get the parameter estimates, 
and then to invert this matrix. But possibly this will be more involved than 
approximating only the parameter estimates. 

The conditions under which the estimator d t {6) performs well need some 
more attention. Especially the difference that seems to exist between the fin- 
dings in this paper and the ones by Geyer and Thompson, concerning the influ- 
ence of the distance from S to t/>, could be examined further. 

The estimation process needs some more attention as well, especially the- 
oretically: can it be expected that the results of unconstrained maximisation 
will be similar to those using one fixed parameter? Some preliminary analyses 
seem to suggest that unconstrained maximisation (and later rescaling to a mean 
of zero) gives results similar to the ones found above, the only difference being 
slightly larger computer times. 

All the estimations done until now were on very small data sets. It is neces- 
sary to get some indication of the precision of the method for larger data sets 
as well. The exactness of the ability estimates are promising in this respect: in 
a way, these can be compared to estimates for very long tests (taken by only a 
few people). However, both the CPU-time and storage needed for larger item 
sets will increase with the number of items. But perhaps this problem will not 
become too serious, though: recall that for estimating the parameters of a larger 
test probably the number of Monte Carlo samples can be substantially reduced. 
This will have an impact on both CPU and storage requirements. 

The above leads me to conclude that although the method works, a lot of 
work still needs to be done to make it suitable for CML estimation in models 
where exact CML estimates are impossible to obtain. 
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Table 1: Chi-square goodness of fit values for the distribution of gene- 
rated response patterns. 



t 


df 


X 2 values for 
B=500 B=1000 


B=2000 


1 


4 


3.94 


2.00 


2.21 


2 


9 


1.88 


12.83 


11.71 


3 


9 


11.19 


8.23 


7.76 


4 


4 


0.27 


11.86 


8.01 



t=totaI score, df=degrees of freedom. 

B = number of generated response patterns. 
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Table 2: Relative accuracy of the estimator d t {6). Results for 1000 
replications. 





aist 




mean 


Ml 


1X1111 




500 


0.1 


1 


-0.0013 


0.022 


-0.043 


0.040 






2 


-0.0015 


0.037 


-0.066 


0.069 






3 


-0.0015 


0.047 


-0.079 


0" : 






4 


-0.0013 


0.054 


-0.091 


0. 




1.0 


1 


-0.012 


0.20 


-0.46 


0.53 






2 


0.021 


0.36 


-0.55 


1.02 






3 


0.080 


0.49 


-0.60 


1.31 






4 


0.135 


0.59 


-0.63 


1.56 




3.0 


1 


-0.074 


0.45 


-0.94 


3.55 






2 


0.178 


1.07 


-0.94 


7.79 






3 


0.821 


2.16 


-0.95 


11.41 






4 


1.678 


3.57 


-0.95 


15.98 


2000 


0.1 


1 


0.0008 


0.021 


-0.041 


0.039 






2 


0.0018 


0.036 


-0.065 


0.067 






3 


0.0028 


0.046 


-0.081 


0.084 






4 


0.0035 


0.053 


-0.091 


0.099 




1.0 


1 


-0.014 


0.21 


-0.47 


0.45 






2 


0.030 


0.36 


-0.56 


0.90 






3 


0.088 


0.49 


-0.60 


1.23 






4 


0.145 


0.60 


-0.62 


1.55 




3.0 


1 


-0.092 


0.47 


-0.92 


2.51 






2 


0.146 


0.99 


-0.95 


5.42 






3 


0.697 


1.99 


-0.95 


11.31 






4 


1.478 


3.29 


-0.95 


15.91 



B Number of simulations. 

dist order of distance from t/» to 6, a distance of d means 6 t — ip } < d. V; 

t=total score, mean = average relative error of di{6), sd=standard deviation of average relative error 
rnin = minimum observed value of relative error; max = maximum observed value of relative error 
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Table 3; Item and ability parameter estimates for lsat data. 



data 


item 


mc-it 


cml 


se 


diff 


t 


mc-ab 


cml 


se 


diff 


lsat6 


1 


-1.20 


-1.26 


.12 


.06 


1 


-1.74 


-1.60 


1.18 


-.14 




2 


-.70 


-.62 


.10 


-.08 


2 


-.52 


-.47 


.99 


-.05 




3 


.23 


.17 


.09 


.06 


3 


.53 


.48 


.99 


.03 




4 


.43 


.47 


.08 


-.04 


4 


1.73 


1.60 


1.18 


.12 




5 


1.24 


1.24 


.08 


.00 












lsat 7 


1 


-.79 


-.67 


.10 


-.12 


1 


-1.53 


-1.44 


1.14 


-.09 




2 


-.43 


-.54 


.10 


.11 


2 


-.46 


-.44 


.95 


-.02 




3 


-.08 


-.13 


.09 


.05 


3 


.45 


.44 


.95 


.01 




4 


.52 


.54 


.08 


-.02 


4 


1.54 


1.44 


1.15 


.10 




5 


.78 


.81 


.08 


-.03 













mc-it. nionte carlo estimate for item parameters, mc-ab monte carlo estimate for ability parameters, 
cml exact cml estimates; se estimated standard errors for cml t. total score. 
difT montecarlo-estimates - cml-estimates. 
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Table 4: Item and ability parameter estimates for the memory data. 



item 


mc-est 


cml 


se 


diff 


t 


mc-est 


'cml' 


se 


diff 


1 


-2.42 


-2.49 


.46 


.07 


1 


-2.81 


-2.71 


1.16 


-.10 


2 


-1.04 


-1.02 


.36 


-.02 


2 


-1.68 


-1.69 


.90 


.01 


3 


-.85 


-.91 


.36 


.06 


3 


-.98 


-1.00 


.78 


.02 


4 


.10 


.05 


.39 


.05 


4 


-.39 


-.43 


.73 


.04 


5 


.31 


.33 


.40 


-.02 


5 


.10 


.08 


.70 


.02 


6 


.44 


.49 


.42 


-.05 


6 


.59 


.57 


.70 


.02 


7 


.44 


.49 


.42 


-.05 


7 


1.10 


1.09 


.74 


.01 


8 


.60 


.66 


.43 


-.06 


8 




1.69 


.83 




9 


1.16 


1.07 


.48 


.09 


9 




2.56 


1.80 




10 


1.28 


1.33 


.52 


-.05 













mc-it. monte carlo estimate of item parameters; mc-ab. monte carlo estimate of ability parameters, 
cml exact cml estimates, se estimated standard errors for cmi t: total score 
diff. montecarlo-estimate - cml-estimate 
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Table 5: Time needed for computations. 



data cpu-time real time 



item parameters 


lsat6 


139 


2:37 




lsat7 


148 


2:45 




memory 


458 


8:44 


person parameters 


IsatG 


1596 


28:21 




lsat7 


1713 


33:26 




memory 


665 


11:14 



cpu-time = number of central processor units, 
real time = elapsed time in minutes and seconds. 
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Figure Caption 



Sequential estimates d t (6) for some values of t, for B=500 and B=2000. 

Memory data; distance between 6 } and t/-, < 3, Vj 
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